Experiments in Discriminating Similar Languages
نویسندگان
چکیده
We describe the system built by the National Research Council (NRC) Canada for the 2015 shared task on Discriminating between similar languages. The NRC system uses various statistical classifiers trained on character and word ngram features. Predictions rely on a two-stage process: we first predict the language group, then discriminate between languages or variants within the group. This year, we focused on two issues: 1) the ngram generation process, and 2) the handling of the anonymized (“blinded”) Named Entities. Despite the slightly harder experimental conditions this year, our systems achieved an average accuracy of 95.24% (closed task) and 95.65% (open task), ending up second or (close) third on the closed task, and first on the open task.
منابع مشابه
در کاربرد تشخیص زبان گفتاری GMM-VSM در قالب سیستم GMM
GMM is one of the most successful models in the field of automatic language identification. In this paper we have proposed a new model named adapted weight GMM (AW-GMM). This model is similar to GMM but the weights are determined using GMM-VSM LID system based on the power of each component in discriminating one language from the others. Also considering the computational complexity of GMM-VSM,...
متن کاملDiscriminating Similar Languages: Evaluations and Explorations
We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using e...
متن کاملDiscriminating between Similar Languages with Word-level Convolutional Neural Networks
Discriminating between Similar Languages (DSL) is a challenging task addressed at the VarDial Workshop series. We report on our participation in the DSL shared task with a two-stage system. In the first stage, character n-grams are used to separate language groups, then specialized classifiers distinguish similar language varieties. We have conducted experiments with three system configurations...
متن کاملDiscriminating similar languages: experiments with linear SVMs and neural networks
This paper describes the systems we experimented with for participating in the discriminating between similar languages (DSL) shared task 2016. We submitted results of a single system based on support vector machines (SVM) with linear kernel and using character ngram features, which obtained the first rank at the closed training track for test set A. Besides the linear SVM, we also report addit...
متن کاملMerging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection
This paper presents the compilation of the DSL corpus collection created for the DSL (Discriminating Similar Languages) shared task to be held at the VarDial workshop at COLING 2014. The DSL corpus collection were merged from three comparable corpora to provide a suitable dataset for automatic classification to discriminate similar languages and language varieties. Along with the description of...
متن کامل